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Abstract 

This paper examines the use of a residual bootstrap for bias correction in machine learn¬ 
ing regression methods. Accounting for bias is an important obstacle in recent efforts to 
develop statistical inference for machine learning methods. We demonstrate empirically that 
the proposed bootstrap bias correction can lead to substantial improvements in both bias and 
predictive accuracy. In the context of ensembles of trees, we show that this correction can be 
approximated at only double the cost of training the original ensemble without introducing 
additional variance. Our method is shown to improve test-set accuracy over random forests by 
up to 70% on example problems from the UCI repository. 


1 Introduction 

This paper proposes a bootstrap-based means of correcting bias in ensemble methods in Machine 
Learning. In non-parametric predictive modeling, accuracy is obtained by a trade-off between bias 
and variance. However, until recently, little attention has been given to quantifying either of these 
quantities. Very recently mm have developed tools to quantify the variance in random forests 
(RF) [5 and other ensemble methods such as bagging [4]. These papers developed central limit 
theorems for the predictions of ensemble methods with a variance that scales as These 

results follow heuristic means of producing confidence intervals in and 0- m examined tests of 
variable importance and variable interaction. However, such confidence intervals and tests provide 
inference around the expected value of the prediction, rather than the expectation value of a new 
observation - that is, they neither quantify nor correct for bias unless that bias decreases faster than 
It is important to note that while variants on RF have been shown to have consistent 
predictions [5], when prediction accuracy is targeted, the bias in a prediction is generally as large 
as the standard error, meaning that the inferential procedures so far developed must be interpreted 
carefully. 

This paper presents a method to decrease the bias of ensemble methods via a residual bootstrap. 
Bias correction via the bootstrap has a substantial history mm-, although it does not reduce the 
order of the bias in kernel smoothing except at the edges of covariate space, it can still yield 
substantial performance improvements. It also provides an opportunity to improve prediction - 
while many of the papers cited above quantify variance in predictions, none reduce it. By contrast, 
the methods we present below can yield a substantial improvement in predictive performance for 
regression problems. 
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The use of a residual bootstrap in non-parametric regression has been examined in m, how¬ 
ever its direct application to machine learning methods has been hampered by the computational 
complexity involved in re-fitting a prediction model over B bootstrap replicates. We demonstrate 
here that in the context of ensemble methods, an approximate residual bootstrap can be computed 
at the same additional cost as computing only one - rather than B - additional predictive models. 
We further provide an analysis of the variance associated with conducting it. In simulation and on 
example data, this bias correction not only significantly reduces bias, it can also result in dramatic 
improvements in predictive accuracy for regression problems. 


2 The Bootstrap and Bias Corrections 

The bootstrap was introduced in with the aim of assessing variability in statistics when a 
theoretical value is either unknown or not estimable. It also presents a means of correcting for 
some forms of bias. The idea is simply to simulate from the empirical distribution of the data 
(i.e. resample with replacement) as a means of constructing an approximation of the sampling 
distribution of the statistic. This is expressed as: 

For a data set Xi,..., X„ and a statistic of interest T{Xi, ...,X„): 

• For b from 1 up to B 

1. Form a bootstrap sample Xi ^,..., X^,, by resampling Xi,... ,X„ with replace¬ 
ment. 

2. Calculate = r(Xi,,..., X„J 

• Treat T^,..., T'® as a sample from the sampling distribution of r(Xi,..., X„) and 
in particular obtain 

— Estimates of the sampling variance of r(Xi,..., X„) and 

— A correction to potential bias 

T- = r(Xi,...,X„)- (1 ^ T" - r(Xi,...,X„) I = 2r(Xi,...,x„)-i ^ 

An analysis of the asymptotic properties of the bootstrap can be found in [12] among many others. 

There is an immediate connection between the bootstrap as detailed above and the bagging 
methods proposed in [3] and used also in RF [3] - the statistic in question being expressed as the map 
from a training set to the prediction of a single tree. However, since these methods already employ 
a bootstrap procedure, bootstrapping them again would represent a considerable burden. While 
the bootstrap standard deviation is a consistent estimate of the variability of r(Xi,..., X„), it does 
not estimate the variance of reason both [T] and [5] employed subsampling 

rather than full bootstrap sampling which enables a variance calculation by extending results for 
U-statistics and the infinitesimal jacknife m- 

The fact that these methods already contain a bootstrap procedure means that the bias cor¬ 
rection above - bootstrapping a bagged estimate - cannot be expected to perform well. Instead, 
we propose employing a residual bootstrap; see [nun]. This is a modified bootstrap for regression 
models of the form: 

r, = F(X,) + e. 
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in which F (specified parametrically or non-parametrically) is the object of interest. For this model, 
sampling from the residual bootstrap can be expressed, following an estimate of F{Xi) as 

1. Obtain residuals Ci =Yi — F{Xi) 

2. Obtain new responses by bootstrapping these residuals 

Y^ = F{X,) + e., 

with the pairs employed to create a new estimate . In the context of nonparametric 

regression, m examined bias and variance estimates for kernel smoothing; the coverage of confi¬ 
dence intervals was examined in [15j . There are numerous variants on this procedure, for example 
the ii can be centered and inflated to adjust for the optimism in F. 

While this paper is focussed on regression methodologies, classification can be handled by re¬ 
placing the bootstrap sample of residuals with a simulation from P{Yi = l|Xi) according to the 
model - the parametric bootstrap m- 

In the next section, we outline a residual bootstrap that can be applied efficiently to ensemble 
methods. 


3 A Cheap Residual Bootstrap for Ensembles 

The naive implementation of a residual bootstrap methodology for RF and other ensemble methods 
requires recomputing the ensemble B times; one for each bootstrap. Here we show that this is 
unnecessarily computationally intensive if we are only interested in obtaining a bias correction (see 
[1] for variance estimates). 

The key here is that, rather than learning an entirely new RF for each residual bootstrap, we 
can simply learn a single new tree. To make this formal, we take Tx{{Xi, Yi),..., (X„, Y„), w) to 
be the function that builds a tree from the data {Xi, Yi), ..., (X„, Yn) using random number seed 
oj and makes a prediction at the point x. A prediction from a RF can then be expressed as 

1 ^ 

Fb{x) = -J2^AiXu,Y,J,...,{X^„ ),u;t) 

^ 6=1 


and an estimate of residuals can be obtained by examining the out of bag predictions. That is, we 
denote by /(,, the set of indices of the observations that occur in bootstrap sample b, then define 


= Y - 


1 


^ T,((Ai„riJ,...,(A„„Y„J,a;,) 


( 1 ) 


to be the residuals calculated from the trees which were not trained using Yi. We can now use these 
as being the equivalent of inflated residuals in a residual bootstrap. 

In order to assess the bias in this estimate, we construct a shortened residual bootstrap according 
to the following algorithm: 


1. Obtain residuals e° 

2. For b from 1 to Bg 
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(a) Obtain a bootstrap sample of residuals and form new predicted values Y° = F{Xi) + 

^ob 

(b) Build a tree using a bootstrap sample of the the data pairs Tx{{Xii^,YfJ ,..., 

3. Return F°^{x) = ^ ..., 

This estimate requires building only Bg trees, rather than the BBg required in a naive implemen¬ 
tation. 

Following this, we can construct a bias-corrected estimate from 

F^ss^{x) = 2Fb{x)-F°^{x). 

We label this the bias-corrected Random Forest (RFc). Note that while we are able to cheaply 
assess bias in this manner, the collection of Tj;{{Xii^,YfJ ,..., {Xn^,Y°J do not allow us to assess 
variance. For this we can employ the methods proposed in [T]. 


4 Computational Costs and Theoretical Properties 

[T] demonstrated that under mild regularity conditions, predictions from random forests built using 
subsamples of size m = o{y/n) out of n examples have the following central limit theorem 

A NiO.l) ( 2 ) 

in which Ci(2^) Cm(x) have known expressions. For an idealization of RF, [6] relaxed this 
condition to allow m = o(n/ log(n)^) in the case that n/B —> 0 where p is the dimension of x. 

We see here that the variance in this central limit theorem is 0(min(n, 5)“^). [TB] identihes 
the two terms as the distinction between infinite RF’s (in which B = oo) and their Monte Carlo 
approximation used in practice. Following this approach, we identify an infinite bootstrap F^ 
achieved by setting Bg = B = oo. From the law of large numbers, we can equivalently think of 
F^{x) as the expectation of FbbA^) taken over all randomization elements, including the selection 
of bootstrap samples. In this framework we obtain the following uniform convergence rate, the 
proof of which is given in the supplementary materials: 

Theorem 4.1. Let Y^ = F{Xi) + Ci, ~ iV(0, cr^) and let ||F||oo) the supremum of F on the 
support of X, be finite. Then 

E {f%bS^) - ^ + +^^(1+ 41ogW)] • 

Thus, so long as Bg = the variance associated with employing a reduced number of 

residual bootstraps can be ignored asymptotically. In practice, we have used Bg = B oi Bg = 2B 
and found our results insensitive to this choice; hence the bias correction may be made at no more 
than the same cost as obtaining the original ensemble. 

We remark here that the log(n) factor is a consequence of a bound on E max^ e| and conjecture 
that it is not sharp. Theorem |4.1| can be readily extended to a condition that the Ci have sub- 
Gaussian tails. Furthermore, when the Y^ are bounded, we can replace log(n) with a constant. 
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Similar rates can be shown to hold in the case of a parametric bootstrap in which Yi is simulated 
according to FsiXi). 

Note here that this calculation does not include the variance associated with the infinite boot¬ 
strap bias correction. That is, even with an idealized = oo, will still have some variance 
and F^ may be more variable than Foo - we observe about 50% additional variance for our bias- 
corrected estimates in the examples below. A central limit theorem for F^ of the form of ([^ can 
be obtained from an extension of ^ using 2-sample U-statistics, and in in particular has variance of 
order 0(min(n, hence maintaining our calculations. However, formal inference for F'^ also 

needs to account for the correlation between Fb and F^ and is beyond the scope of this paper. 


5 Numerical Experiments 


In this section we present simulation experiments to examine the effect of bootstrap bias correc¬ 
tions. An advantage of employing simulated data is that bias can be evaluated explicitly instead of 
resorting to predictive accuracy which confounds bias and variance. 

As a first study. Figure presents the results of employing a bagged decision tree using only 
one covariate generated uniformly on the interval [0, 1]. One dimensional examples were produced 
in order to visualize the effect of bias at the edges of the data. We examined two response models: 

yi=Xi + €i, y, = -{Xi - 0.5)^ -b u 

which have different bias properties. In each case, was generated from a Gaussian distribution 
with standard deviation 0.1. We used 1000 observations and built 1000 trees - intended to be 
enough to reduce variance due to subsampling at any test size. We present results for providing 
trees with subsamples of size 20 and of size 200. Above each figure we report 

Bias Imp The percentage improvement in squared bias over an uncorrected estimate, averaged 
over 100 test points with the same distribution as the training data. Bias was calculated by 
the difference between the prediction averaged over all simulations at each point and the true 
prediction function. 

Pred Imp The percentage improvement in squared error between predictions for eaeh simulation 
and the true prediction function. Note that this is a measure for noiseless observations at new 
data points. We would expect this to decrease if noise were added to the test set responses. 


Var Ratio The ratio of the variance of bias-corrected predictions to the variance of uncorrected 
predictions. 


For one dimensional simulations, the bias correction we propose helps to reduce bias, but this 
may come at the cost of an increase in variance and hence a reduction in over-all accuracy. This is 
particularly true for large subsample sizes in which bias is less important. 

However, we are rarely interested in one-dimensional prediction. We also expect bias to be 
larger in higher dimensions and we therefore experimented with a 10 dimensional model. For this 
simulation, 5000 examples were generated from a 10-dimensional Gaussian model with variance 
1.8 for each feature and covariance 0.8 between each feature pair. Here again, two models were 
considered: 

10 


y* = 


\ 






4/10 


(3) 


1=1 
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% Bias Imp 0.29 % Pred Imp 0.23 Var Ratio 1.26 



% Bias Imp 0.25 % Pred Imp 0.22 Var Ratio 1.08 



% Bias Imp 0.48 % Pred Imp -0.44 Var Ratio 1.47 


% Bias Imp 0.44 % Pred Imp -0.34 Var Ratio 1.35 




Figure 1: Effect of bias correction in 1 dimensional bagged trees. Dashed lines provide exact 
relationship. Dotted: 5%, 95% and mean values of predictions from bagged trees. Solid: 5%, 95% 
and mean values of predictions from bias-corrected bagged trees. Top row: based on subsamples of 
size 20; bottom row: subsamples of size 200. 


in which the have standard deviation 0.1. For each simulated data set we built 1,000 trees using 
subsamples of sizes 500 and 5000 (the latter being bootstrap samples) using both CART and RF 
trees. In Table we report the statistics described above. For CART trees we see a 35% to 55% 
reduction in bias as well as a 20% to 50% reduction in prediction error, representing a significant 
improvement in both; the improvement is similar for RF except for large bags in the second case 
where we still achieve a 10% reduction. We note that while using larger subsample sizes result is 
smaller (though still significant) improvement, they can compromise the distributional results on 
which inferential proceedures such as those in [J rely. 

We should note that we must expect the use of this bias correction to have much more limited 
effect on prediction accuracy for classification problems. This is because there is higher relative 
variance in these problems, overwhelming the bias improvement. Moreover, for prediction accu¬ 
racy, we need only determine the classification boundary, making bias correction elsewhere useless. 
Table reports the results of classification experiments analogous to those above. For each simu¬ 
lation setting, the true probability was a logistic transform of a scaled and shifted version of the 
response function used in the regression models. The bias correction was obtained by generating 
new responses for each tree according to the estimated probability from the original ensemble. We 
measured both squared-error accuracy in terms of ability to fit the true response and improvement 
in misclassification risk. Here we see mixed results for improvement in estimating the underlying 
probability. Unsurprisingly the effect on misclassification rate is negligible. We note that while this 
correction may not be useful for predictive classification, it may still be desirable when the target 
is scientific inference. 
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Function 

Subsample 

Type 

Bias Imp 

Pred Imp 

Var Ratio 

(E.'i 1^.1)''* 

500 

BT 

0.55 

0.51 

1.45 


500 

RF 

0.54 

0.48 

2.01 


5000 

BT 

0.35 

0.2 

1.29 


5000 

RF 

0.25 

0.11 

1.46 


500 

BT 

0.37 

0.35 

1.43 


500 

RF 

0.54 

0.52 

1.73 


5000 

BT 

0.35 

0.36 

1.08 


5000 

RF 

0.24 

0.11 

1.46 


Table 1: Performance of bootstrap bias correction in 10-dimensional regression examples. Functions 
are given in ^ and models are averages of 1,000 trees based on 5,000 data points using subsamples 
of size 500 or 5,000 for each tree. Ensemble type is either bagged trees (BT) or Random Forests 
(RF). Bias Imp = percent improvement in squared bias. Pred Imp = percentage improvement 
in mean squared error. Var Ratio = ratio of averaged pointwise variances between corrected and 
uncorrected decision trees. 


Dimension 

Subsample 

logit(P(y = 1)) 

Bias Imp 

Pred Imp 

Var Rat 

Miss Imp 

1 

20 

3(a;- 1/2) 

0.2 

0.07 

1.4 

-0.001 


200 


0.54 

-0.45 

1.53 

-0.008 


20 

-30(a;-1/2)^-2.17 

0.33 

0.32 

1.26 

0.06 


200 


0.71 

0.04 

1.68 

-0.011 

10 

500 

5(E-=ikl)'/"-5 

0.52 

-0.01 

2.17 

-0.005 


5000 


0.37 

-0.64 

1.95 

-0.02 


500 

-2E--i^? + 2.4 

0.42 

0.33 

1.94 

0.01 


5000 


0.42 

-0.1 

1.87 

-0.013 


Table 2: Performance of bootstrap bias correction for simulation of classification tasks. Column 
headings are as in Table in addition, Mis Imp = relative misclassihcation improvement on test 
data. 
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N 

P 

Var(Y) 

RF.Err 

RFc.Err 

RF.Imp 

RFc.Imp 

airfoil [T9] 

1503 

5 

46.95 

12.56 

7.29 

0.73 

0.42 

auto-mpg |20| 

392 

7 

61.03 

7.45 

7.02 

0.88 

0.06 

BikeSharing-hour [^ 

17379 

14 

32913.74 

55.28 

36.6 

1 

0.34 

ccpp HD 

9568 

4 

291.36 

10.8 

9.92 

0.96 

0.08 

communities |23j 

1994 

96 

0.05 

0.02 

0.02 

0.66 

-0.01 

Concrete (Mj 

1030 

8 

279.08 

27.24 

18.94 

0.9 

0.3 

housing [25] 

506 

13 

0.17 

0.02 

0.02 

0.88 

0.09 

parkinsons [26] 

5875 

16 

66.14 

42.01 

40.79 

0.36 

0.03 

SkillCraft [57] 

3338 

18 

2.1 

0.84 

0.84 

0.6 

-0.01 

winequality-white [28] 

4898 

11 

0.79 

0.35 

0.33 

0.55 

0.05 

winequality-red [5S] 

1599 

11 

0.65 

0.33 

0.32 

0.5 

0.03 

yacht-hydrodynamics [29] 

308 

6 

229.55 

13.27 

3.45 

0.94 

0.74 


Table 3: Cross validation performance of random forests and the bias correction in 12 UCI regression 
tasks. In the above Var(Y) gives the variance of the responses, RF.Err is the cross-validated MSE 
for random forests, RFc.Err is the cross-validated MSE for bias-corrected random forests, RF.Imp = 
1 - RF.Err/Var(Y) is the improvement of random forests relative to predicting a constant, RFc.Imp 
= 1 - RFc.Err/RF.Err is the relative improvement of adding the bias correction. 


6 Case Studies 

In order to assess the impact of the proposed bias correction in real world data, we applied random 
forests with and without the bias correction to 12 data sets in the UCI repository m for which the 
task was labelled as regression. A description of the processing for each case study can be found 
in supplemental materials. In each data set, we applied 10-fold cross-validation to estimate the 
predictive mean squared error of RF and RFc. For each cross-validation fold, we learned a random 
forest using 1000 trees as implemented in the raindomForest package [18] in R and employed a bias 
correction using 2000 residual bootstrap trees. These results - reported in Table [^- are insensitive 
to using either 1000 or 5000 residual bootstrap trees. In most cases RFc reduced squared error 
compared to RF by between 2 and 10 percent. However, some examples {airfoil, BikeSharing, 
Concrete, yacht-hydrodynamics) saw very substantial MSE reductions (42%, 34% 30% and 74% 
respectively). The bias correction increased MSE by 1% in two examples. We omitted results for 
forestfires in which rRF performed no better than predicting a constant and where RFc increased 
MSE by 7%. 

It is difficult to draw broader patterns from these results. However, the bias correction appears 
to help most in cases with large signal to noise ratios (using RF reduces MSE by a large amount 
relative to predicting a constant) but that it is reduced when the dimension of the feature space is 
very large. 


7 Conclusion 

We have proposed a residual bootstrap bias correction to random forests and other ensemble meth¬ 
ods in machine learning. This correction can be calculated at no more than the same cost of 







learning the original ensemble. We have shown that this procedure substantially reduces bias is 
almost all problems - an important consideration when carrying out statistical inference. In some 
regression problems, it can also lead to substantial reduction in predictive mean squared error. Our 
focus has been on the effect of this bias correction on RF and we have therefore not compared this 
performance to other methods. However, we expect that applying this correction to other learning 
algorithms would demonstrate similar results, although it may not be possible to do so without 
large computational overhead. 

Theoretically, we have shown that the Monte Carlo error in this correction can be ignored pro¬ 
vided more residual bootstrap samples are used than used to build the original ensemble. However, 
we have not treated the properties of the bias correction under infinite resampling. In the case of 
low-dimensional kernel smoothing with bandwidth h, the bias on the interior of the support of X 
is 0{h?) and the residual bootstrap proposed here will not change this (although an alternatively 
correction will). However, near the edge of covariate support, the residual bootstrap will decrease 
the order of bias from 0{h) to 0{h?). A possible explanation for the success of this correction is 
that for tree-based methods in moderate dimensions, most covariate values are near the edge of 
this support. We also believe that a central limit theorem can be obtained for , but doing so 
will need to account for the use of Fb to learn F^ . 
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A Proof of Theorem 14.11 

Proof. We begin by writing the prediction at x from an individual tree as 


Tb{x,n) 


•s^ L{x, Xj, ilb) 

^ N{x,nb) 

n 

Y,W,{x,^b)Y, 


whererib is the realization of a random variable that describes both the selection of bootstrap or 
subsamples used in learning the tree Tb as well as any additional random variables involved in the 
learning process (e.g. the selection of candidate split variables in RF). Here L{x,Xi,V.b) is the 
indicator that x and Xi are in the same leaf of a tree learned with randomization parameters Hf, 
and N{x, is the number of leaves in the same leaf as x. We will also write 


^ 6=1 


as the average weight on Yi across all resamples so that 

n 

FB{x) = Y,Wfi^)Y^ 

2 = 1 


Note that 

n n 

2=1 2=1 

We can similarly write a residual-bootstrap tree as 

n n 

i=l j=l 
n n 

= EE V,j{x,nbo)[F{X,) + {Yj - F{X,))] 

i=l j=l 

with the corresponding quantities 


where we also have 




6 ° = 1 


^ ^ v,,{x, nb^) = Y,Y ° (^) = 1 - 


i=l j = l 


i=l j=l 
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Using these quantities we can write 

n n 71 

= ^ 2W^{x)Y, -EE V,f{^)[W.) + (U, - i^B(X,))] 

i=i i=i j=i 

Th 71 7L 71 71 71 

= ^ 2W,«(a:)U - E E + E E E (^) " Wf{X,)) Y,. 

i=l i—1 j—1 i—1 j — 1 k—1 

Hence letting Varf 2 (mi(a:, U)) indicate variance with respect to only the randomization parameters 
n, writing Yi = F{Xi) + a and observing that 0 < Wi{x, U) < 1, 0 < Vij[x, U) < 1: 


2 g / ri \2 inn 

E < —UyVarn ( E j + EE Uy(x,u)y, 


i=l j=l 


71 71 71 


+ —ErYarn^o^nb IEEE (wfix.Mt) - wf{x„n,)) u 

i=l j=l k=l 


, 8 2 
-[b^K 
2 

< ^ [l|f|lL + '^"(l + 'llog(n))] + ^ [||F||^+ »»(! +41og(n))]. 


2max(_F(Xi) — F{Xj))'^ + 2max(ei — CjY 


2inax{2F{Xi) — 2E(Xj))‘^ + 2max{2ei — 2tjY 


Here we use the fact that for ei,..., e„ ~ N{0, 1), max, = 1 + 41og(n). 


□ 


B Details of Case Study Data Sets 

After processing each data set as described below, we employed 10-fold cross-validation to obtain 
cross-validated squared error for both Fb and Fgg , removing the final data entries to create 
equal-sized folds. To maintain comparability, the same folds were used for both estimates. We set 
B = 1000 and Bo = 2000, but these results were insensitive to setting Bo = 1000 or Bo = 5000. 

Below we detail each data set and the processing steps taken for it; unless processing is noted, 
data were taken as is from the UCI repository m- 

airfoil 42% improvement over RF. Task is to predict sound pressure in decibels of airfoils at 
various wind tunnel speeds and angles of attack m- 1503 observations, 5 features. 

auto-mpg 6% improvement over RF. Task is to predict city-cycle fuel consumption in miles 
per gallon from physical car and engine characteristics [20]. Rows missing horsepower were 
removed resulting in 392 examples with 8 features, 3 of which are discrete. 

BikeSharing-hour 34% improvement over RF. Prediction of number of rental bikes used each 
hour over in a bike-sharing system |21] . Date and Season (columns 2 and 3) were removed 
from features as duplicating information, leaving 13 covariates related to time, weather and 
number of users. 17389 examples; prediction task was for log counts. 
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communities -1% improvement over RF. Prediction of per-capita rate of violent crime in U.S. 
cities [23]. 1993 examples, 96 features. 30 (out of original 125) feature removed due to 
high-missingness including state, county and data associated with police statistics. One row 
(Natchezcity) deleted due to missing values. Cross-validation was done using independently- 
generated folds. 

CCPP 8% improvement over RF. Prediction of net hourly output from Combined Cycle Power 
Plants |22|. 4 features and 9568 examples. 

Concrete 3% improvement over RF. Prediction of concrete compressive strength from constituent 
components [2J. 9 features, 1030 examples. 

forestfires -8% improvement over RF. Prediction of log(area-l-l) burned by forrest fires from 
location, date and weather attributes [30]. 517 examples, 13 features. Not reported in main 
paper because Random Forests predictions had 15% higher squared error than a constant 
prediction function. 

housing 9% improvement over RF. Predict median housing prices from demographic and geo¬ 
graphic features for suburbs of Boston [l^. Response was taken to be the log on median 
house prices. 506 examples, 14 attributes. 

parkinsons 3% improvement over RF. Prediction of Motor UPDRS from voice monitoring data 
in early-state Parkinsons patients [2^. Removed features for age, sex, test-time and Total 
UPDRS, resulting in 15 features and 5875 examples. 

SkillCraft -1% improvement over RF. Predict league index of gameres playing SkillCraft based on 
playing statistics m- Entries with NA’s removed; results in 3338 examples and 18 features. 

winequality-white 5% improvement over RF. Predict expert quality score on white wines based 
on 11 measures of wine composition [23]. 4898 examples. 

winequality-red 3% improvement over RF. As in winequality-white for red wines [25]. 1599 
examples. 

yacht-hydrodynamics 70% improvement over RF. Predict residuary resistance per unit weight 
of displacement of sailing yachts from hull geometry [22]. 308 examples, 7 features. 
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